Introduce CONNECT threadpool #31546

DaveCTurner · 2018-06-24T19:27:45Z

Today we attempt to (re-)connect to our peers using the management threadpool.
However, during a network partition there may sometimes be a large number of
concurrent connection attempts. Connection attempts to partitioned nodes or to
nodes in containers that are no longer running can hang until they timeout,
possibly blocking other reconnection attempts and other management activity for
an extended period of time. Moreover, connecting to a peer is a relatively
lightweight operation so it is reasonable to attempt a lot of them in parallel.

This change introduces a separate threadpool solely for connecting to peers.

Fixes #29023.

Today we attempt to (re-)connect to our peers using the management threadpool. However, during a network partition there may sometimes be a large number of concurrent connection attempts. Connection attempts to partitioned nodes or to nodes in containers that are no longer running can hang until they timeout, possibly blocking other reconnection attempts and other management activity for an extended period of time. Moreover, connecting to a peer is a relatively lightweight operation so it is reasonable to attempt a lot of them in parallel. This change introduces a separate threadpool solely for connecting to peers. Fixes elastic#29023.

elasticmachine · 2018-06-24T19:27:47Z

Pinging @elastic/es-core-infra

elasticmachine · 2018-06-24T19:27:48Z

Pinging @elastic/es-distributed

DaveCTurner · 2018-06-24T19:29:10Z

server/src/main/java/org/elasticsearch/threadpool/ThreadPool.java

@@ -186,6 +188,8 @@ public ThreadPool(final Settings settings, final ExecutorBuilder<?>... customBui
        builders.put(Names.FETCH_SHARD_STARTED, new ScalingExecutorBuilder(Names.FETCH_SHARD_STARTED, 1, 2 * availableProcessors, TimeValue.timeValueMinutes(5)));
        builders.put(Names.FORCE_MERGE, new FixedExecutorBuilder(settings, Names.FORCE_MERGE, 1, -1));
        builders.put(Names.FETCH_SHARD_STORE, new ScalingExecutorBuilder(Names.FETCH_SHARD_STORE, 1, 2 * availableProcessors, TimeValue.timeValueMinutes(5)));
+        builders.put(Names.CONNECT, new ScalingExecutorBuilder(Names.CONNECT, 1,
+            boundedBy(10 * availableProcessors, 10, 100), TimeValue.timeValueSeconds(10)));


NB I do not know if these numbers are reasonable.

jasontedor · 2018-06-24T20:01:08Z

There is an open community PR for this: #30150.

DaveCTurner · 2018-06-24T20:22:43Z

So there is. I'll look at that in more detail tomorrow.

DaveCTurner added >enhancement :Distributed Coordination/Network Http and internode communication implementations v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v6.4.0 labels Jun 24, 2018

DaveCTurner requested review from bleskes and ywelsch June 24, 2018 19:27

DaveCTurner commented Jun 24, 2018

View reviewed changes

DaveCTurner closed this Jun 24, 2018

lcawl added the >non-issue label Aug 22, 2018

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

DaveCTurner deleted the 2018-06-24-connect-threadpool branch July 23, 2022 10:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce CONNECT threadpool #31546

Introduce CONNECT threadpool #31546

DaveCTurner commented Jun 24, 2018

elasticmachine commented Jun 24, 2018

elasticmachine commented Jun 24, 2018

DaveCTurner Jun 24, 2018

jasontedor commented Jun 24, 2018

DaveCTurner commented Jun 24, 2018

Introduce CONNECT threadpool #31546

Introduce CONNECT threadpool #31546

Conversation

DaveCTurner commented Jun 24, 2018

elasticmachine commented Jun 24, 2018

elasticmachine commented Jun 24, 2018

DaveCTurner Jun 24, 2018

Choose a reason for hiding this comment

jasontedor commented Jun 24, 2018

DaveCTurner commented Jun 24, 2018